The avid movie watcher instantly recognizes these iconic lines, each one imprinting smiles and memories across generations, etched into film history. But are movies like this dying? Are we as audiences, to blame for growing too selective? Has filmmaking itself changed? Or is there something more complex at play?
This analysis aims to answer these questions using data and machine learning 1. While some may fear that analytics could turn filmmaking into an equation; understanding the key factors that influence audience perception will always be important. The role of AI in filmmaking, like in any industry, is not to replace creativity but to enhance understanding. These tools are merely that, not blueprints for high-quality storytelling, but lenses through which we can analyze trends, audience preferences, and the shifting landscape of cinema.
Data Collection & Cleaning
A strong analysis starts with a strong dataset. At first, we considered using a pre-scraped IMDb dataset of 1,000 movies, but the limitations quickly became clear. The sample was too small, and there was no way to confirm that the movies had been selected randomly (as we can’t simply select the best movies or most recents). To ensure a dataset built for real insights, we needed to collect our own data. This would allow full control over what was included, minimizing the need for changes like removing inconsistencies in formatting, and allowing me to have full confidence in generalizing our results to the entire movie population.
Initially, our scraping2 approach was straightforward: pull random IMDb movie IDs and filter out bad data afterward. But as we worked through the dataset, a clear pattern emerged. Movies with missing ratings and descriptions almost always had low vote counts. Instead of collecting first and filtering later, we restructured the process. Our scraper first checked a movie’s vote count using IMDbPY. Only if the votes exceeded 5,000 did it proceed to retrieve the full movie data. This approach ensured that our dataset was built on movies that have garnered significant audience engagement, eliminating the need for excessive wrangling3 and reducing the likelihood of bias from underseen, low-quality films.
After cleaning and reshaping the data—engineering a few new features and standardizing variables for analysis—each row represented a single movie, ready for exploration totaling over 9,660 movies. Here’s a snapshot of Forrest Gump, classified as an Epic, with over 2.3 million votes, a 34-word description, and a standout IMDb rating of 8.8.
Show Code
import pandas as pd import matplotlib.pyplot as pltimport seaborn as snsfrom IPython.display import Markdown, displayimport plotly.io as piopio.renderers.default ="iframe_connected"# allows for stable ploty rendering for HTML export# Set pandas to display the full content of columnspd.set_option('display.max_colwidth', None) # Read-in data into our dataframedf = pd.read_csv("/Users/rileysvensson/Desktop/CAA - Data Analyst/Projects/imdb_data/final/IMDB_clean/IMDB.csv")# Clean the 'Genres' column by removing "Back to top" text from our scrapingdf["Genres"] = df["Genres"].str.replace("Back to top", "", regex=False).str.strip()# Define the set of well-known genres to extract from niche genresstandard_genres = {"Action", "Adult", "Adventure", "Animation", "Biography", "Comedy", "Crime", "Documentary","Drama", "Family", "Fantasy", "Film Noir", "Game Show", "History", "Horror", "Musical","Music", "Mystery", "News", "Reality-TV", "Romance", "Sci-Fi", "Short", "Sport","Talk-Show", "Thriller", "War", "Western"}# Function to strictly extract standard genres if they appear anywhere in the genre stringdef extract_standard_genres(genre_str):if pd.isna(genre_str):returnNone# Handle missing values genre_str_lower = genre_str.lower() # Convert to lowercase for matching filtered_genres = {std_genre for std_genre in standard_genres if std_genre.lower() in genre_str_lower}return", ".join(filtered_genres) if filtered_genres elseNone# Apply the strict genre extraction functiondf["Cleaned_Genres"] = df["Genres"].apply(extract_standard_genres)# Create a binary column for "Highly Rated" movies (IMDb rating >= 7.0)df["Highly_Rated"] = (df["Rating"] >=7).astype(int)# Function to clean and deduplicate actor names within each row, handling NAN'sdef clean_stars_column(stars):if pd.isna(stars):return""# Split name by commas, strip spaces, standardize formatting, and remove duplicates unique_names =sorted(set(name.strip() for name in stars.split(",")))# Rejoin into a single, clean stringreturn", ".join(unique_names)# Apply function to "Stars" columndf["Stars"] = df["Stars"].apply(clean_stars_column)# Create function to count # of words in the movie descriptiondef count_words(description):# Check if "Read all", "..." or "--" is in the text and remove truncated parts truncated =Falsefor marker in ["Read all", "...", "--"]:if marker in description: description = description.split(marker)[0].strip() truncated =True# Count words in the cleaned description word_count =len(description.split())# Indicate if the description was truncated, by adding +return word_count ifnot truncated elsef"{word_count}+"# Apply function to Description columndf["Description Word Count"] = df["Description"].apply(count_words)# Create function to bin word countdef categorize_word_count(word_count): count =int(str(word_count).replace("+", ""))return"<5"if count <5else"5-10"if count <10else"10-15"if count <15else"15-20"if count <20else"20-25"if count <25else"25-30"if count <30else"30+"# Apply binning functiondf["Word_Count_Binned"] = df["Description Word Count"].apply(categorize_word_count)# Count number of unique genresdf["Num_Genres"] = df["Cleaned_Genres"].apply(lambda x: len(x.split(",")) if pd.notna(x) and x.strip() else0)# Keep only rows where 'Year' starts with 19 or 20df = df[df["Year"].astype(str).str.match(r"^(19|20)\d{2}$")]# Convert Year to numericdf["Year"] = pd.to_numeric(df["Year"])# Function to extract the first genre from "Cleaned_Genres"def extract_primary_genre(genre_str):if pd.isna(genre_str) or genre_str.strip() =="":returnNone# Handle missing valuesreturn genre_str.split(",")[0].strip() # Take the first genre# Apply function to create the "Cleaned_Primary_Genre" columndf["Cleaned_Primary_Genre"] = df["Cleaned_Genres"].apply(extract_primary_genre)# Create function to create new column 1 if the Writer of a movie is also the Director, 0 otherwisedef director_is_writer(row): director =str(row['Director']).strip().lower() writers =str(row['Writers']).split(',') writers_cleaned = [w.strip().lower() for w in writers]returnint(director in writers_cleaned)df['Director_Is_Writer'] = df.apply(director_is_writer, axis=1)# Drop movies with a duration greater than 300 minutes, as these were very obscure outliersdf = df[df["Duration"] <=300]# Display Forest Gump as an example recorddisplay_movie_ex = df[df['Title'].isin(['Forrest Gump',''])]display_movie_ex
IMDB_ID
Title
Stars
Year
Genres
Description
Certificate
Writers
Director
Primary Genre
Votes
Rating
Duration
Cleaned_Genres
Highly_Rated
Description Word Count
Word_Count_Binned
Num_Genres
Cleaned_Primary_Genre
Director_Is_Writer
7519
tt0109830
Forrest Gump
Gary Sinise, Robin Wright, Tom Hanks
1994
Epic, Drama, Romance,
The history of the United States from the 1950s to the '70s unfolds from the perspective of an Alabama man with an IQ of 75, who yearns to be reunited with his childhood sweetheart.
PG-13
Winston Groom, Eric Roth
Robert Zemeckis
Epic
2359386
8.8
142.0
Romance, Drama
1
34
30+
2
Romance
0
Get to Know the Cast: Data Definition
Every film has its cast of characters and so does every dataset. This section introduces the key variables behind each movie in the analysis, from obvious stars like IMDb rating and genre, to quieter supporting roles like description length and certificate. Understanding these features4 lays the groundwork for later modeling, helping us interpret which factors might truly shape audience opinion.
Show Code
import pandas as pdfrom IPython.display import HTML# Define glossary as a list of dictionariesglossary_data = [ {"Variable": "Title", "Description": "The name of the movie"}, {"Variable": "Year", "Description": "The year the movie was released"}, {"Variable": "Genres", "Description": "All genres from IMDb"}, {"Variable": "Cleaned_Genres", "Description": "Filtered version of `Genres` with only standard genres"}, {"Variable": "Cleaned_Primary_Genre", "Description": "First genre listed in `Cleaned_Genres`"}, {"Variable": "Rating", "Description": "IMDb rating out of 10"}, {"Variable": "Votes", "Description": "Total number of IMDb votes received (as of March 2025)"}, {"Variable": "Duration", "Description": "Runtime in minutes"}, {"Variable": "Certificate", "Description": "Age suitability rating (e.g., PG, PG-13, R)"}, {"Variable": "Stars", "Description": "List of lead actors"}, {"Variable": "Writers", "Description": "List of lead writers"}, {"Variable": "Director", "Description": "Main director"}, {"Variable": "Description", "Description": "IMDb plot summary"}, {"Variable": "Description Word Count", "Description": "Word count of the description"}, {"Variable": "Word_Count_Binned", "Description": "Binned version of description length (e.g., <5, 10-15, 30+)"}, {"Variable": "Num_Genres", "Description": "Count of genres per movie"}, {"Variable": "Highly_Rated", "Description": "1 if movie’s rating is above 7.0, otherwise 0"}, {"Variable": "Director_Is_Writer", "Description": "1 if the director is also a writer (creative control)"},]# Convert DataFrame to HTML without indexglossary_html = glossary_df.to_html(index=False, escape=False)# Displaydisplay(HTML(glossary_html))
Variable
Description
Title
The name of the movie
Year
The year the movie was released
Genres
All genres from IMDb
Cleaned_Genres
Filtered version of `Genres` with only standard genres
Cleaned_Primary_Genre
First genre listed in `Cleaned_Genres`
Rating
IMDb rating out of 10
Votes
Total number of IMDb votes received (as of March 2025)
Duration
Runtime in minutes
Certificate
Age suitability rating (e.g., PG, PG-13, R)
Stars
List of lead actors
Writers
List of lead writers
Director
Main director
Description
IMDb plot summary
Description Word Count
Word count of the description
Word_Count_Binned
Binned version of description length (e.g., <5, 10-15, 30+)
Num_Genres
Count of genres per movie
Highly_Rated
1 if movie’s rating is above 7.0, otherwise 0
Director_Is_Writer
1 if the director is also a writer (creative control)
Exploratory Data Analysis
Through exploratory data analysis (EDA), we let the data speak for itself while also applying our intuition to guide initial insights. EDA is a crucial first step in any analysis, as it helps surface hidden patterns, validate assumptions, and frame the questions worth asking as we dig deeper into the data. With this approach in mind, we turned to the IMDb dataset to uncover what shapes audience ratings. One of the first questions we asked was whether the sheer volume of films produced each year could be impacting how audiences perceive and rate them.
Show Code
import plotly.graph_objects as goimport pandas as pdimport plotly.io as pioimport plotly.io as piopio.renderers.default ="plotly_mimetype+notebook"# Filter out invalid years (keeping only reasonable movie years)df_movie_count_over_time = df[(df['Year'] >=1940) & (df['Year'] <=2023)]# Count movies per yearmovie_counts = df_movie_count_over_time["Year"].value_counts().sort_index()df_movies = pd.DataFrame({'Year': movie_counts.index, 'Count': movie_counts.values})fig = go.Figure(go.Bar( x=df_movies['Year'], y=df_movies['Count'], marker_color="#E6B91E"))fig.update_layout( plot_bgcolor="black", paper_bgcolor="black", title=dict(text="🎬 Movies by Year", font=dict(size=26, color="gold"), x=0.5), xaxis=dict( title="Year", tickmode='linear', dtick=5, tickfont=dict(size=14, color='white'), title_font=dict(size=16, color='white'), linecolor='white' ), yaxis=dict( title="Number of Movies", gridcolor="white", tickfont=dict(size=14, color='white'), title_font=dict(size=16, color='white'), linecolor='white' ), showlegend=False)fig
Show Code
import plotly.graph_objects as goimport pandas as pdimport plotly.io as pio# Use this to render inside Quarto or Jupyter Notebookpio.renderers.default ="plotly_mimetype+notebook"# Filter out invalid yearsdf_rating_over_time = df[(df['Year'] >=1940) & (df['Year'] <=2023)]# Calculate average rating per yearavg_rating_per_year = df_rating_over_time.groupby("Year")["Rating"].mean().reset_index()# Create line chartfig = go.Figure()fig.add_trace(go.Scatter( x=avg_rating_per_year["Year"], y=avg_rating_per_year["Rating"], mode="lines+markers", marker=dict(color="red", size=6), line=dict(color="red", width=2), name="Average Rating"))# Update layout to match your bar chart themefig.update_layout( plot_bgcolor="black", paper_bgcolor="black", title=dict(text="⭐ Average IMDb Rating Over Time", font=dict(size=26, color="gold"), x=0.5), xaxis=dict( title="Year", tickmode='linear', dtick=5, tickfont=dict(size=14, color='white'), title_font=dict(size=16, color='white'), linecolor='white', gridcolor='white' ), yaxis=dict( title="Average Rating",range=[5, 9], tickfont=dict(size=14, color='white'), title_font=dict(size=16, color='white'), linecolor='white', gridcolor='white' ), showlegend=False)fig
Our intuition wasn’t far off.
These figures paint a compelling story. As more movies are being made in modern society, ratings in-turn have steadily declined. While correlation doesn’t imply causation, this pattern raises several key questions:
Has the rapid increase in film production diluted overall quality?
Have audiences become more critical & selective in a world of endless choice?
Or have audiences become more selective and critical, mindless to the time, effort, and money put into every film?
It’s also possible that modern films face a tougher climb. Older classics that remain in circulation today may reflect a kind of survivorship bias, as only the best from past decades endure in public memory and have votes to show for it. Meanwhile, today’s releases are flooded into a saturated market, fighting for attention and cultural relevance. In this environment, even strong films may struggle to stand out, and average ones fade quickly into obscurity.
Movie Ratings Increase with Number of Votes
While ratings offer a snapshot of how a film is received, they don’t exist in a vacuum. Behind every number is an audience—watching, reacting, and choosing whether or not to engage. To better understand what drives these ratings and how they gain traction, it’s helpful to look not just at the scores themselves, but at the level of audience participation behind them.
Show Code
import numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom mpl_toolkits.axes_grid1.inset_locator import inset_axes# Define asymptotic modeldef asymptotic_model(x, L=9, k=0.00001):return L - np.exp(-k * x)# Datavotes = df["Votes"]ratings = df["Rating"]# Masksoutlier_mask = votes >=150_000# Asymptotic curvex_vals = np.linspace(votes.min(), votes.max(), 1000)y_asymptote = asymptotic_model(x_vals)# Main plotplt.figure(figsize=(10, 6))sns.set(style="whitegrid")main_ax = plt.gca()main_ax.set_facecolor("black")plt.gcf().patch.set_facecolor("black")# All datasns.scatterplot(x=votes, y=ratings, alpha=0.6, color="#E6B91E", s=30, ax=main_ax, label="All Movies")main_ax.plot(x_vals, y_asymptote, color="green", linestyle="--", linewidth=2, label="Asymptotic Fit (→ 9)")# Main plot stylingmain_ax.set_xlim(0, votes.max())main_ax.set_ylim(0, 10)main_ax.set_xlabel("Number of Votes (mil)", fontsize=14, color="white")main_ax.set_ylabel("Rating", fontsize=14, color="white")main_ax.set_title("IMDb Rating vs. Number of Votes", fontsize=20, color="gold", weight="bold")main_ax.tick_params(colors='white', labelsize=12)for spine in main_ax.spines.values(): spine.set_edgecolor('white')main_ax.grid(True, linestyle='--', alpha=0.5, color='white')main_ax.legend(facecolor="black", edgecolor="white", fontsize=12, labelcolor='white')# Layout adjustment (no tight_layout)plt.subplots_adjust(top=0.92, bottom=0.1, left=0.1, right=0.95)plt.show()
The figure above illustrates a clear and expected trend: movies with higher ratings tend to receive a significantly greater number of votes. This positive, logarithmic 5 relationship (which levels off around 9.0, as no film has ever reached a perfect 10) suggests that top-rated films don’t just flourish in their release. They also inspire wider attention and engagement. Acclaimed movies often benefit from word-of-mouth, rewatchability, and cultural impact, all of which contribute to the snowball effect of visibility and votes.
In other words, ratings and votes often reinforce one another: the better a movie is perceived, the more people are inclined to watch and rate it.
Genre Trends: Thriving or Dying?
Show Code
import plotly.graph_objects as goimport pandas as pd# Filter out "Film-Noir"filtered_df = df[df['Cleaned_Primary_Genre'] !='Film-Noir'].copy()# Clean and bin the yearsfiltered_df['Year'] = pd.to_numeric(filtered_df['Year'], errors='coerce')filtered_df = filtered_df.dropna(subset=['Year'])filtered_df['Year'] = filtered_df['Year'].astype(int)# Define 25-year periodsbins = [1950, 1975, 2000, 2025]labels = ['1950–1975', '1975–2000', '2000–2025']filtered_df['Year_Range'] = pd.cut(filtered_df['Year'], bins=bins, labels=labels)# Loop through each 25-year periodfor label in labels: subset = filtered_df[filtered_df['Year_Range'] == label] genre_stats = subset.groupby('Cleaned_Primary_Genre').agg( avg_rating=('Rating', 'mean'), movie_count=('Rating', 'count') ).reset_index()# Keep genres with >10 movies and sort by avg_rating genre_stats = genre_stats[genre_stats['movie_count'] >10] genre_stats = genre_stats.sort_values('avg_rating', ascending=True)# Create static bar chart fig = go.Figure(go.Bar( x=genre_stats['avg_rating'], y=genre_stats['Cleaned_Primary_Genre'], orientation='h', marker_color='#E6B91E', text=[f"{count} movies"for count in genre_stats['movie_count']], textposition="outside", textfont=dict(color='white', size=13) )) fig.update_layout( plot_bgcolor="black", paper_bgcolor="black", title=dict( text=f"⭐ Average IMDb Rating by Genre ({label})", font=dict(size=24, color="gold"), x=0.5 ), xaxis=dict( title="Average Rating",range=[5, 9], tickfont=dict(color='white', size=14), title_font=dict(color='white', size=16), showline=True, linecolor='white', showgrid=False ), yaxis=dict( title="Genre", tickfont=dict(color='white', size=14), title_font=dict(color='white', size=16), linecolor='white' ) )
Over the past 75 years, the role of genre in shaping movie ratings has evolved significantly. By breaking the data into three 25-year periods, we can observe how genre trends have shifted—both in popularity and in how strongly they correlate with critical reception.
1950–1975
Biography films led in average ratings.
Horror consistently struggled.
Drama and Comedy were the most commonly produced, but ratings were evenly spread proving that genre didn’t play a huge role in how well a movie was received.
1975–2000
Documentaries emerged as top performers, signaling a shift toward more grounded, real-life storytelling.
New genres like Animation, Family, and Sci-Fi started showing up more often.
Comedy continued to dominate in volume but trailed in ratings — indicating it’s hard to please everyone with humor.
2000–2025
Documentary and Biography remain the most highly rated.
Comedy and Horror remain oversaturated and difficult to differentiate in.
What started as a relatively flat landscape has sharpened into a clear genre hierarchy. In earlier eras, genre choice had little bearing on a movie’s success. But in today’s saturated market, the stakes are higher. Genres like Documentary and Biography consistently outperform their size, while others struggle to maintain quality at scale. For modern filmmakers, this suggests that genre is no longer just a creative choice—it’s a strategic one.
Act II: What Makes a High Rated Movie?
Building a Model to Predict IMDB Rating
This brings us to the heart of the analysis: modeling6.
This model is not designed to decode the perfect formula for producing a high-rated movie, as doing so would strip the art of its soul. Instead, its purpose is to explore the patterns behind audience perception, helping us understand how films resonate across time. Just as AI assists in medicine, finance, and countless other fields, here it serves not as a creator, but as an observer—one that can uncover insights without dictating creativity.
To evaluate model performance, we used two standard scoring7 metrics:
Show Code
from IPython.display import HTMLdisplay(HTML('''<table style="width:100%; border-collapse: collapse; margin: 2em 0; font-size:15px;"> <thead style="background-color:#f9f9f9;"> <tr> <th style="padding:10px; border: 1px solid #999;">Metric</th> <th style="padding:10px; border: 1px solid #999;">What It Means</th> <th style="padding:10px; border: 1px solid #999;">Why It Matters</th> </tr> </thead> <tbody> <tr> <td style="padding:10px; border: 1px solid #999;"><strong>RMSE</strong> (Root Mean Squared Error)</td> <td style="padding:10px; border: 1px solid #999;">Shows how far off the model’s predictions are from the actual movie ratings, on average.</td> <td style="padding:10px; border: 1px solid #999;">An RMSE of 0.89 means the model is usually within 1 star of the actual IMDb rating (on a 10-star scale).</td> </tr> <tr> <td style="padding:10px; border: 1px solid #999;"><strong>R² Score</strong> (Coefficient of Determination)</td> <td style="padding:10px; border: 1px solid #999;">Tells us how well the model explains the differences in movie ratings.</td> <td style="padding:10px; border: 1px solid #999;">A higher R² means the model does a better job at understanding what drives audience scores.</td> </tr> </tbody></table>'''))
Metric
What It Means
Why It Matters
RMSE (Root Mean Squared Error)
Shows how far off the model’s predictions are from the actual movie ratings, on average.
An RMSE of 0.89 means the model is usually within 1 star of the actual IMDb rating (on a 10-star scale).
R² Score (Coefficient of Determination)
Tells us how well the model explains the differences in movie ratings.
A higher R² means the model does a better job at understanding what drives audience scores.
Together, these metrics provide a lens for assessing how well the model understands audience sentiment. Not perfectly, but well enough to reveal patterns beneath the noise.
Features used: Certificate, Word_Count_Binned, Director, Cleaned_Primary_Genre, Director_Is_Writer, Duration, Num_Genres, Year Random Forest RMSE: 0.87 Random Forest R² Score: 0.27
Cross-Validation Performance (5-Fold)
Average R² Score: 0.289 R² Scores per Fold: 0.272, 0.315, 0.302, 0.294, 0.263
Predicting movie ratings isn’t just a technical challenge, it’s a human one. Preferences are personal, emotional, and unpredictable. Even Netflix’s famed 1 million competition only improved RMSE by 0.09 points over their baseline, reinforcing a natural ceiling for accuracy. This project’s 0.89 RMSE lands within that same range, despite a far simpler modeling stack. It speaks to a key truth: complexity doesn’t always win, especially when interpretability matters.
While XGBoost is often favored for squeezing out marginal gains, Random Forest edged it out here, beating its RMSE with (0.87) and slightly surpassing it in R² (0.289 vs. 0.28). With values above 0.25 already considered strong in applied contexts, Random Forest offers the best trade-off between performance and interpretability, making it the most practical choice here.
Feature Selection & Importance
In machine learning, a black box model refers to an algorithm that can make highly accurate predictions, but doesn’t easily reveal how it arrives at those predictions. Unlike simpler models where each variable’s effect is clearly laid out, black box models like Random Forest and XGBoost rely on layered decision rules that are often hidden from direct view.
But “black box” doesn’t have to mean unknowable. With the right tools, we can begin to unpack the inner logic of these models and pull meaning from complexity. Instead of diving into dense technical plots (see Appendix), we’ve highlighted a few of the most influential factors that shaped predictions, focusing on patterns around runtime, and creative control.
Show Code
import numpy as np# Define the threshold for "Top Rated" movies (e.g., IMDb rating >= 7.0)top_rated_movies = df[df["Highly_Rated"] ==1]["Duration"].dropna()# Confidence intervals to calculateconfidence_intervals = [75, 90, 99]# Dictionary to store resultsduration_ranges = {}for confidence in confidence_intervals:# Compute lower and upper bounds dynamically based on confidence interval lower_percentile = (100- confidence) /2# 75%, this gives 12.5% upper_percentile =100- lower_percentile # e.g., for 75%, this gives 87.5% lower_bound, upper_bound = np.percentile(top_rated_movies, [lower_percentile, upper_percentile])# Convert minutes to hours and minutes lower_hours, lower_minutes =divmod(int(lower_bound), 60) upper_hours, upper_minutes =divmod(int(upper_bound), 60)# Store result duration_ranges[confidence] = (f"{lower_hours}h {lower_minutes}m", f"{upper_hours}h {upper_minutes}m")# 75% of top-rated movies are between 1h 32m and 2h 22m# 90% of top-rated movies are between 1h 24m and 2h 42m# 99% of top-rated movies are between 1h 5m and 3h 21m
While most top-rated movies fall within the typical runtime sweet spot, a few of the highest-rated films, like The Godfather and Lord of the Rings, break this mold with runtimes of over 3 hours. These exceptions show that while longer films can be the best of the best; they tend to be rare and must truly earn their length.
Act III: Show Me the Money (data)
Sometimes, descriptive analytics8 are just as telling and valuable to an organization as a well-built predictive model. So when we have both, why not use them? In a landscape where it’s difficult to gather accurate, representative data on the film industry due to the sheer volume of creative works being released every day, these types of insights can be just as impactful.
Assuming who the most popular actors and directors are based on public perception is useful in itself, but being able to attach a measurable value to that perception is even more powerful. These creatives are proven to elevate the rating of a feature film simply by having their name in the credits, and their average rating across all the works they’ve been part of supports that claim.
Show Code
import pandas as pdfrom IPython.display import HTML# Ensure 'Rating' is numericdf["Rating"] = pd.to_numeric(df["Rating"], errors="coerce")# --- 1️⃣ Actors with Highest Average IMDb Rating (Min 10 Movies) ---df_actors = df.assign(Actor=df["Stars"].str.split(", ")).explode("Actor")actor_stats = df_actors.groupby("Actor").agg( avg_rating=("Rating", "mean"), num_movies=("Actor", "count")).reset_index()# Filter, round, and sorttop_actors = actor_stats[actor_stats["num_movies"] >=10].copy()top_actors["avg_rating"] = top_actors["avg_rating"].round(2)top_actors = top_actors.sort_values("avg_rating", ascending=False)top_actors_display = top_actors.head(20).rename(columns={"Actor": "Actor","avg_rating": "Avg. Rating","num_movies": "Number of Movies (min 10)"})# Add a ranking column from 1 to 20top_actors_display = top_actors_display.reset_index(drop=True)top_actors_display.index +=1top_actors_display.index.name ="Rank"top_actors_display_md = top_actors_display.reset_index()# Display as HTML tabledisplay(HTML('<h3 style="text-align: center;">🎭 Top 20 Actors by Average Rating</h3>'))display(HTML(top_actors_display_md.to_html(index=False, classes="table", border=0)))
🎭 Top 20 Actors by Average Rating
Rank
Actor
Avg. Rating
Number of Movies (min 10)
1
Humphrey Bogart
7.70
11
2
James Stewart
7.64
12
3
Leonardo DiCaprio
7.58
20
4
Orson Welles
7.55
10
5
Henry Fonda
7.54
11
6
Ingrid Bergman
7.50
10
7
Ian McKellen
7.50
13
8
Tony Leung Chiu-wai
7.46
11
9
Song Kang-ho
7.45
10
10
Cary Grant
7.37
17
11
Joan Allen
7.31
10
12
Laurence Olivier
7.28
10
13
Brad Pitt
7.28
30
14
Patrick Stewart
7.27
11
15
James Mason
7.27
10
16
Anthony Quinn
7.25
11
17
Kirk Douglas
7.22
12
18
Chris Cooper
7.21
10
19
Marlon Brando
7.21
19
20
Irrfan Khan
7.19
12
Show Code
# --- 2️⃣ Directors with Highest Average IMDb Rating (Min 10 Movies) ---df_directors = df.groupby("Director").agg( avg_rating=("Rating", "mean"), num_movies=("Director", "count")).reset_index()top_directors = df_directors[df_directors["num_movies"] >=10].sort_values("avg_rating", ascending=False)top_directors_display = top_directors.head(20).rename(columns={"Director": "Director","avg_rating": "Avg. Rating","num_movies": "Number of Movies (min 10)"})# Round "Avg. Rating" to 2 decimal placestop_directors_display["Avg. Rating"] = top_directors_display["Avg. Rating"].round(2)# Reset index and add ranktop_directors_display = top_directors_display.reset_index(drop=True)top_directors_display.index +=1top_directors_display.index.name ="Rank"top_directors_display_md = top_directors_display.reset_index()# Display as markdowndisplay(HTML('<h3 style="text-align: center;">🎬 Top 20 Directors by Average Rating</h3>'))display(HTML(top_directors_display_md.to_html(index=False, classes="table", border=0)))
🎬 Top 20 Directors by Average Rating
Rank
Director
Avg. Rating
Number of Movies (min 10)
1
Christopher Nolan
8.15
11
2
Hayao Miyazaki
7.97
11
3
Stanley Kubrick
7.87
10
4
Billy Wilder
7.74
10
5
Peter Jackson
7.68
12
6
David Fincher
7.66
11
7
Martin Scorsese
7.56
24
8
Alfred Hitchcock
7.53
21
9
Joel Coen
7.50
10
10
Howard Hawks
7.43
12
11
Steven Spielberg
7.40
31
12
Yimou Zhang
7.40
12
13
Guy Ritchie
7.32
11
14
James Mangold
7.27
11
15
Alan Parker
7.25
11
16
Kevin Macdonald
7.22
10
17
Ken Loach
7.21
11
18
Robert Zemeckis
7.18
16
19
John Huston
7.16
12
20
Roman Polanski
7.16
14
Show Code
# --- 3️⃣ Actors Who Appeared in the Most Movies ---most_movies_actors = actor_stats.copy()most_movies_actors["avg_rating"] = most_movies_actors["avg_rating"].round(2)most_movies_actors = most_movies_actors.sort_values("num_movies", ascending=False)most_movies_display = most_movies_actors.head(20).rename(columns={"Actor": "Actor","avg_rating": "Avg. Rating","num_movies": "Number of Movies (min 10)"})# Round "Avg. Rating" to 2 decimal placesmost_movies_display["Avg. Rating"] = most_movies_display["Avg. Rating"].round(2)# Reset index and add rankmost_movies_display = most_movies_display.reset_index(drop=True)most_movies_display.index +=1most_movies_display.index.name ="Rank"most_movies_display_md = most_movies_display.reset_index()# Display as HTML with rankdisplay(HTML('<h3 style="text-align: center;">⭐️ Most Active Actors</h3>'))display(HTML(most_movies_display_md.to_html(index=False, classes="table", border=0)))
⭐️ Most Active Actors
Rank
Actor
Avg. Rating
Number of Movies (min 10)
1
Robert De Niro
6.73
63
2
Samuel L. Jackson
6.45
60
3
Nicolas Cage
6.07
54
4
Bruce Willis
6.15
53
5
Liam Neeson
6.43
47
6
Johnny Depp
6.83
45
7
Dennis Quaid
6.43
45
8
Nicole Kidman
6.46
43
9
Morgan Freeman
6.53
42
10
Denzel Washington
6.94
41
11
Tom Hanks
7.14
41
12
Anthony Hopkins
6.65
40
13
Woody Harrelson
6.53
39
14
Sylvester Stallone
6.07
38
15
Matt Damon
7.02
38
16
Gene Hackman
6.74
38
17
Ethan Hawke
6.66
37
18
Jeff Bridges
6.73
37
19
Clint Eastwood
6.94
37
20
Ewan McGregor
6.58
36
Show Code
# --- 4️⃣ Directors Who Have Directed the Most Movies ---most_movies_directors = df_directors.sort_values("num_movies", ascending=False)most_directors_display = most_movies_directors.head(20).rename(columns={"Director": "Director","avg_rating": "Avg. Rating","num_movies": "Number of Movies (min 10)"})# Round "Avg. Rating" to 2 decimal placesmost_directors_display["Avg. Rating"] = most_directors_display["Avg. Rating"].round(2)# Add a new 1-based index for displaymost_directors_display = most_directors_display.reset_index(drop=True)most_directors_display.index +=1most_directors_display.index.name ="Rank"most_directors_display_md = most_directors_display.reset_index()# Display as HTML with rankdisplay(HTML('<h3 style="text-align: center;">🎥 Most Active Directors</h3>'))display(HTML(most_directors_display_md.to_html(index=False, classes="table", border=0)))
🎥 Most Active Directors
Rank
Director
Avg. Rating
Number of Movies (min 10)
1
Clint Eastwood
6.92
33
2
Steven Spielberg
7.40
31
3
Ridley Scott
6.99
26
4
Martin Scorsese
7.56
24
5
Steven Soderbergh
6.81
22
6
Alfred Hitchcock
7.53
21
7
Woody Allen
7.00
21
8
Brian De Palma
6.73
20
9
Tim Burton
6.99
20
10
Ron Howard
7.02
20
11
Sidney Lumet
6.88
19
12
Francis Ford Coppola
7.12
19
13
Walter Hill
6.47
19
14
Barry Levinson
6.56
18
15
Oliver Stone
6.83
18
16
Renny Harlin
5.58
17
17
Joel Schumacher
6.39
17
18
Blake Edwards
6.52
16
19
Robert Zemeckis
7.18
16
20
Tony Scott
6.81
16
Show Code
from IPython.display import Markdown, displaydf["Actor_List"] = df["Stars"].apply(lambda x: [actor.strip() for actor in x.split(",") if actor.strip()])df_collab = df.explode("Actor_List")df_collab["Actor_Director_Collab"] = df_collab["Actor_List"] +" & "+ df_collab["Director"]collab_stats = ( df_collab.groupby("Actor_Director_Collab") .agg( Num_Collaborations=("Title", "count"), Avg_IMDB_Rating=("Rating", "mean"), Percent_Highly_Rated_Movies=("Highly_Rated", "mean") ) .reset_index())collab_stats = collab_stats[collab_stats["Num_Collaborations"] >=3].copy()collab_stats["Avg_IMDB_Rating"] = collab_stats["Avg_IMDB_Rating"].round(2)collab_stats["Percent_Highly_Rated_Movies"] = (collab_stats["Percent_Highly_Rated_Movies"] *100).round(2)top_collabs = collab_stats.sort_values(by="Avg_IMDB_Rating", ascending=False).head(20).reset_index(drop=True)top_collabs.index +=1top_collabs = top_collabs.reset_index().rename(columns={"index": "Rank","Actor_Director_Collab": "Actor–Director Duo","Num_Collaborations": "Number of Collaborations","Avg_IMDB_Rating": "Avg. IMDb Rating","Percent_Highly_Rated_Movies": "% Highly Rated Movies"})# Display as HTML with rankdisplay(HTML('<h3 style="text-align: center;">☯️ Best-Rated Actor–Director Duos</h3>'))display(HTML(top_collabs.to_html(index=False, classes="table", border=0)))
☯️ Best-Rated Actor–Director Duos
Rank
Actor–Director Duo
Number of Collaborations
Avg. IMDb Rating
% Highly Rated Movies
1
Marlon Brando & Francis Ford Coppola
3
8.93
100.00
2
Elijah Wood & Peter Jackson
3
8.90
100.00
3
Al Pacino & Francis Ford Coppola
4
8.75
100.00
4
Tarik Akan & Ertem Egilmez
3
8.53
100.00
5
Christian Bale & Christopher Nolan
4
8.52
100.00
6
Uma Thurman & Quentin Tarantino
4
8.45
100.00
7
Brad Pitt & David Fincher
3
8.40
100.00
8
Ian McKellen & Peter Jackson
5
8.38
100.00
9
Clint Eastwood & Sergio Leone
3
8.30
100.00
10
James Caan & Francis Ford Coppola
3
8.23
66.67
11
Joe Pesci & Martin Scorsese
4
8.20
100.00
12
Robert Downey Jr. & Anthony Russo
3
8.20
100.00
13
Tatsuya Nakadai & Akira Kurosawa
3
8.17
100.00
14
Samuel L. Jackson & Quentin Tarantino
3
8.07
100.00
15
Toshirô Mifune & Akira Kurosawa
4
8.07
100.00
16
Grace Kelly & Alfred Hitchcock
3
8.03
100.00
17
James Stewart & Alfred Hitchcock
4
8.02
100.00
18
Arnold Schwarzenegger & James Cameron
3
8.00
100.00
19
Michael Biehn & James Cameron
3
8.00
100.00
20
Leonardo DiCaprio & Martin Scorsese
5
7.98
100.00
Comparing Originals to Their Successors
Show Code
import pandas as pdimport refrom rapidfuzz import process, fuzzfrom collections import defaultdict# Step 1: Clean titledef clean_title(title): title = title.lower() title = re.sub(r'[^\w\s]', '', title)return re.sub(r'\s+', ' ', title).strip()df_titles = df[["Title", "Rating", "Year"]].copy()df_titles["clean_title"] = df_titles["Title"].apply(clean_title)# Step 2: Extract first strong word (excluding filler words)filler_words = {"the", "a", "an", "of", "and", "in", "on", "at", "to", "for"}def get_main_word(title): words = title.split()for word in words:if word notin filler_words:return wordreturn words[0] if words else"unknown"df_titles["block_key"] = df_titles["clean_title"].apply(get_main_word)# Step 3: Group by block_keydf_titles["cluster_id"] =-1matched =set()clusters = defaultdict(list)cluster_id =0for key, group in df_titles.groupby("block_key"):iflen(group) ==1: idx = group.index[0] df_titles.at[idx, "cluster_id"] = cluster_id cluster_id +=1continue# skip singletons titles = group["clean_title"].tolist() index_map =dict(zip(titles, group.index))for i, title inenumerate(titles):if index_map[title] in matched:continue results = process.extract(title, titles, scorer=fuzz.token_set_ratio, score_cutoff=70) matched_idxs = [index_map[r[0]] for r in results if index_map[r[0]] notin matched]for idx in matched_idxs: matched.add(idx) clusters[cluster_id].append(idx) cluster_id +=1# Step 4: Assign backfor cid, indices in clusters.items(): df_titles.loc[indices, "cluster_id"] = ciddf_titles = df_titles.sort_values(["cluster_id", "Year"])# Filter to only clusters with more than 1 membernon_singleton_clusters = df_titles["cluster_id"].value_counts()non_singleton_clusters = non_singleton_clusters[non_singleton_clusters >1].indexdf_clustered = df_titles[ df_titles["cluster_id"].isin(non_singleton_clusters) & (df_titles["cluster_id"] !=-1)].sort_values(["cluster_id", "Year"])
We wanted to see how the original movie of a series—whether a remake like Top Gun and Top Gun: Maverick, or a sequel like Cars and Cars 2—compares to its “follower,” for lack of a better term.
To do this, we first used fuzzy matching9 to cluster together movies with similar titles. After running the algorithm, we manually checked each grouping to confirm they belonged to the same franchise or storyline and ensure accuracy. This comparison also accounts for situations where, due to scraping limitations, only the second through fourth movies in a series were captured. In those cases, the earliest available film was treated as the “original.”
Simply put, we compared every movie’s first release date to all following films within its cluster10, allowing for a larger sample size of comparisons. The results were astounding, but sensible.
The comparison wasn’t just anecdotal, it was statistically significant. Originals consistently outperformed their sequels, remakes, and follow-ups. A two-sample t-test confirmed that the difference in ratings was not due to random chance or a coincidence.
Show Code
import pandas as pdfrom IPython.display import Markdown, display# Start fresh with a copy of your existing DataFramedf_clustered = df_series_remakes.copy()# Filter out unclustered (-1)df_clustered = df_clustered[df_clustered['cluster_id'] !=-1]# Sort by cluster and yeardf_clustered = df_clustered.sort_values(by=['cluster_id', 'Year'])# Label roles: first = original, rest = followerdef assign_roles(group): group = group.sort_values(by='Year') group['role_new'] = ['original'] + ['follower'] * (len(group) -1)return groupdf_clustered_labeled = df_clustered.groupby('cluster_id', group_keys=False).apply(assign_roles)# Compare followers to originalcomparison_rows_new = []for cluster_id, group in df_clustered_labeled.groupby('cluster_id'): group = group.sort_values(by='Year')iflen(group) <2:continue original = group.iloc[0]for _, follower in group.iloc[1:].iterrows(): comparison_rows_new.append({'cluster_id': cluster_id,'original_title': original['Title'],'original_rating': original['Rating'],'follower_title': follower['Title'],'follower_rating': follower['Rating'],'follower_same_or_better': follower['Rating'] >= original['Rating'] })# Create new comparison DataFramedf_follower_vs_original = pd.DataFrame(comparison_rows_new)# Summary statstotal_comparisons_new =len(df_follower_vs_original)better_or_equal_new = df_follower_vs_original['follower_same_or_better'].sum()percentage_new = (better_or_equal_new / total_comparisons_new) *100display(Markdown(f""" Out of **{total_comparisons_new}** remake comparisons, **{better_or_equal_new}** were rated the same or better than the original. That’s **{percentage_new:.2f}%** of the time."""))
Out of 494 remake comparisons, 89 were rated the same or better than the original.
That’s 18.02% of the time.
This confirms a broader trend: while follow-ups may have higher budgets or better effects, they rarely capture the audience approval earned by the original.
Act IV: The Verdict
So, why does any of this matter?
We weren’t just modeling for the sake of prediction — we were trying to understand what really connects with audiences, and why some films seem to resonate while others fade. Here’s what stood out:
Creative Choices that Resonate
Runtime: There’s a consistent sweet spot — the majority of best-rated films tend to run between 1h 32m and 2h 22m.
Director-Writer Control: When the same person directs and writes, ratings are higher — likely a result of more unified vision.
Realism Resonates: Genres like Biography and Documentary outperform others, pointing to a strong audience appetite for real, grounded storytelling.
Followers Rarely Outperform Originals: While sequels & remakes are common, it’s very rare for them to surpass the original in ratings — suggesting that they need to be done right to receive the same attention as the original story.
Simple is Special - Modeling
We didn’t need budget data to get meaningful results. Using only surface-level features like cast, runtime, genre, and creative roles, we were able to predict IMDb ratings with strong accuracy.
We even tested the models on upcoming releases (see Credits) to see if the results still hold.
The Mystery Remains — But We’re Closer
We opened with the question: Are movies getting worse or are they simply getting missed?
And while there may never be a definitive answer, this report gets us closer.
Streaming has dramatically increased content volume, and with it, audience saturation. Today’s well-made films often disappear quickly, overshadowed by aggressive release schedules and the pressure to feed platforms with new content. Even movies that follow the “successful” formula struggle to break through the noise. Ultimately, the challenge may not be that we’re making worse movies — it’s that we’re creating too many, with too little space for any of them to breathe.
The art is still there. It’s just harder to notice.
Credits: A Real-World Application
Show Code
# ========================# 6. Predict for All Movies in File# ========================# Load the file (CSV or Excel)df_predict = pd.read_excel("/Users/rileysvensson/Desktop/CAA - Data Analyst/Projects/imdb_data/final/predict_movies.xlsx") # Drop unneeded columnsX_new = df_predict.drop(columns=drop_cols, errors="ignore")# Transform using the pipelineX_new_trans = ct_all.transform(X_new[categorical_features + numerical_features])X_new_trans = np.array(X_new_trans, dtype=np.float32)# Predict and rescaley_preds_scaled = rf_model.predict(X_new_trans)y_preds_original = scaler_y.inverse_transform(y_preds_scaled.reshape(-1, 1))# Attach predictions to original dataframedf_predict["Predicted_Rating"] = y_preds_original# Sort by predicted rating (highest to lowest)df_predict_sorted = df_predict.sort_values("Predicted_Rating", ascending=False)# Print title and predicted rating for each moviefor title, rating inzip(df_predict_sorted["Title"], df_predict_sorted["Predicted_Rating"]):print(f"🎬 {title}: {rating:.2f}")
🎬 Sinners: 6.74
🎬 Until Dawn: 6.60
🎬 Lilo & Stitch: 6.44
🎬 Sneaks: 6.31
🎬 Final Destination: Bloodlines: 6.16
🎬 Magic Farm: 6.10
🎬 A Minecraft Movie: 5.95
🎬 The Wedding Banquet: 5.89
🎬 The Lost Princess: 4.94
🎬 Bears on a Ship: 4.71
Glossary
1. Machine Learning – Algorithms that learn patterns from data to make predictions or decisions without being explicitly programmed.
2. Scraping - The automated process of extracting data from websites or online sources.
3. Data Wrangling - Cleaning, restructuring, and enriching raw data into a usable format.
4. Features - Individual measurable properties or a synonym to the variables used as inputs in a model.
5. Logarithmic - A type of growth pattern where values increase rapidly at first, then gradually slow down and level off, and used to model relationships that rise quickly before approaching a natural ceiling
6. Modeling - The process of training a machine learning algorithm on data to learn patterns and relationships, with the goal of making predictions or gaining insights
7. Scoring - The use of evaluation metrics to measure a model’s predictive performance.
8. Descriptive Analytics - Analyzing historical data to understand what has happened.
9. Fuzzy Matching - A technique used to find approximate matches between strings, useful for identifying similar but not identical text.
10. Cluster - A group of related records, typically grouped by a shared ID or characteristic — in this case, movies that belong to the same franchise or series.
Appendix
Show Code
import seaborn as snsimport matplotlib.pyplot as pltfrom tabulate import tabulate# Get feature names after transformationfeature_names = ct_all.get_feature_names_out()# Get feature importancesimportances = rf_model.feature_importances_# Create a DataFrame for better readabilityfeature_importance_df = pd.DataFrame({"Feature": feature_names,"Importance": importances}).sort_values(by="Importance", ascending=False)# Convert top 20 to markdown tabletop_20_features = feature_importance_df.head(20)##| Feature | Importance |#|----------------------------------------|--------------|#| num__Duration | 0.175685 |#| num__Year | 0.117457 |#| num__Num_Genres | 0.0316729 |#| cat__Cleaned_Primary_Genre_Drama | 0.0293566 |#| cat__Cleaned_Primary_Genre_Documentary | 0.0179949 |#| cat__Cleaned_Primary_Genre_Biography | 0.0126313 |#| cat__Cleaned_Primary_Genre_Action | 0.0086014 |#| cat__Word_Count_Binned_30+ | 0.00728996 |#| cat__Certificate_Not Rated | 0.00728064 |#| cat__Certificate_R | 0.00695221 |#| cat__Certificate_PG-13 | 0.006874 |#| cat__Director_Is_Writer_1 | 0.00664878 |#| cat__Director_Is_Writer_0 | 0.00639114 |#| cat__Word_Count_Binned_20-25 | 0.0063814 |#| cat__Word_Count_Binned_25-30 | 0.00635609 |#| cat__Cleaned_Primary_Genre_Adult | 0.00603386 |#| cat__Word_Count_Binned_15-20 | 0.00577349 |#| cat__Cleaned_Primary_Genre_Sport | 0.00572772 |#| cat__Director_Jon M. Chu | 0.00543463 |#| cat__Certificate_PG | 0.00534882 |## IMDb theme colorsBACKGROUND ='#000000'TEXT ='#f5f5f5'HIGHLIGHT ='#f5c518'# Set overall seaborn and matplotlib stylesns.set_style("darkgrid")plt.rcParams.update({'axes.facecolor': BACKGROUND,'figure.facecolor': BACKGROUND,'axes.labelcolor': TEXT,'xtick.color': TEXT,'ytick.color': TEXT,'text.color': TEXT,'axes.edgecolor': TEXT,'grid.color': '#444444', # soft gridlines'axes.titleweight': 'bold','axes.titlepad': 15,'axes.titlesize': 14})# 1️⃣ Duration vs IMDb Rating — Sweet spot checkplt.figure(figsize=(8, 5))sns.scatterplot(x=df['Duration'], y=df['Rating'], alpha=0.4, color='white') # Data pointssns.lineplot(x='Duration', y='Rating', data=df, estimator='mean', errorbar=None, color=HIGHLIGHT, linewidth=2)plt.title('IMDb Rating by Duration')plt.xlabel('Duration (min)')plt.ylabel('Rating')plt.grid(True)plt.tight_layout()plt.show()# 2️⃣ Director also being the writer (1 = Yes, 0 = No)plt.figure(figsize=(6, 4))ax = sns.barplot(x='Director_Is_Writer', y='Rating', data=df, palette=[TEXT, HIGHLIGHT])plt.title('Average Rating: Director Also Writer')plt.ylim(6.0, 8.5)plt.xlabel('Is Director Writer?')plt.ylabel('Avg IMDb Rating')plt.tight_layout()plt.show()